Skip to content

Comments

feat: improve tool error spans#1281

Open
casey-brooks wants to merge 7 commits intomainfrom
noa/issue-1279
Open

feat: improve tool error spans#1281
casey-brooks wants to merge 7 commits intomainfrom
noa/issue-1279

Conversation

@casey-brooks
Copy link
Contributor

Summary

  • propagate shell and MCP failures to tool spans via OpenTelemetry
  • capture tool source/error metadata and record span events for logical failures
  • add regression tests covering MCP exceptions, logical errors, and shell exit codes

Testing

  • pnpm --filter @agyn/platform-server lint
  • pnpm --filter @agyn/platform-server test -- --runInBand

#1279

@casey-brooks casey-brooks requested a review from a team as a code owner February 10, 2026 19:01
@casey-brooks
Copy link
Contributor Author

Lint & Test Results

  • pnpm --filter @agyn/platform-server lint
  • pnpm --filter @agyn/platform-server test -- --runInBand

Lint: ✅ no issues
Tests: ✅ 709 passed / 0 failed / 12 skipped

noa-lucent
noa-lucent previously approved these changes Feb 10, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great coverage of the MCP failure scenarios and the regression tests look solid. I left a small nit about making the test helper end/flush spans even when the wrapped function throws.

let provider: BasicTracerProvider;
let contextManager: AsyncLocalStorageContextManager | undefined;

const runWithSpan = async (fn: () => Promise<unknown>): Promise<ReadableSpan[]> => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Could we wrap the context.with block in a try/finally and move span.end()/provider.forceFlush() into the finally? Right now the helper leaves the span open and never flushes if fn throws, which makes it harder to inspect spans for failure cases and can leak state into later tests. Ensuring the span always ends keeps the in-memory exporter consistent even when the reducer rejects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch—runWithSpan now wraps the span end + provider.forceFlush calls in a finally block (while rethrowing the original error), so spans always close/flush even when the wrapped reducer invocation throws. Lint + full tests rerun green.

noa-lucent
noa-lucent previously approved these changes Feb 10, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick follow-up—the span helper now closes/flushes spans in a finally block and rethrows the original error, so the traces stay consistent even when reducer execution fails. 👍

@casey-brooks
Copy link
Contributor Author

Swapped the OTEL instrumentation out entirely. The reducer now plugs MCP failures into the existing run-events tracing path (same mechanism shell_command already uses), and the new tests assert the run-events payloads for MCP exception/logical failures plus shell non-zero exit. Lint + full server tests are green.

@casey-brooks
Copy link
Contributor Author

Hooked the reducer back into the same tracing path shell failures already used. We now capture tool error metadata (code/message/source) whenever a tool throws or returns a logical error, and we persist that through run-events so MCP failures show up just like non-zero shell exits. Updated the tracing/error-handling tests to assert the errorCode propagation, and extended RunEventsService to persist error codes so the UI gets the richer context.

Tests:

  • pnpm --filter @agyn/platform-server lint
  • pnpm --filter @agyn/platform-server test -- --runInBand

@rowan-stein rowan-stein disabled auto-merge February 11, 2026 17:39
@casey-brooks
Copy link
Contributor Author

Implemented Emerson's MCP logical failure heuristic in the reducer:

  • Parse JSON-looking string outputs from MCP tools and flag them as errors when status/statusCode >= 400 and an error/message marker is present (including nested detail fields).
  • Surface the payload details through run-events so tracing and UI immediately show the MCP failure context, while keeping shell behavior untouched.
  • Added five reducer tests to cover the requested scenarios (status/error, statusCode/message, domain data without status, success status w/error string, and non-numeric status strings).

Tests:

  • pnpm --filter @agyn/platform-server lint
  • pnpm --filter @agyn/platform-server test -- --runInBand

Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heuristic update—however the implementation currently never falls back to \ when a \ field is present but non-numeric (e.g. ). Those payloads are still marked as successes, which is exactly the misclassification we were trying to eliminate. Please parse \ first and, if it cannot be converted to a number, fall back to \ (with a regression test).

}

private extractNumericStatusCode(payload: Record<string, unknown>): number | null {
const rawStatus = payload.status ?? payload.statusCode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] The new heuristic never looks at when a field is present but non-numeric (e.g. many MCP responses use + ). Because short-circuits on any defined value, returns and we bail out without inspecting , so those logical failures will still be marked as successes. Could we attempt to parse first and, if it is /NaN, fall back to instead of coalescing before parsing?}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest push: extractNumericStatusCode now only short-circuits when it can parse a numeric status, otherwise it immediately falls back to statusCode. The new regression test (status: 'error', statusCode: 500) exercises the exact payload you called out, and run-events now flag it as an MCP error.

}

private extractNumericStatusCode(payload: Record<string, unknown>): number | null {
const rawStatus = payload.status ?? payload.statusCode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] The new heuristic never looks at \statusCode\\ when a \status\\ field exists but contains a non-numeric value (common MCP responses use status: "error" plus statusCode: 500). Because we coalesce before parsing, payload.status ?? payload.statusCode grabs the string "error", toNumber returns null, and we bail without ever checking the real status code, so those logical failures are still recorded as successes. Please attempt to parse \status\\ first and, if it is null/NaN, fall back to parsing \statusCode\\.

Consider adding a regression test that covers { status: 'error', statusCode: 500, error: '...' } to prove the fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch—status strings like "error" were indeed short-circuiting the fallback. The reducer now parses status first and, when that isn’t numeric, falls back to statusCode before giving up. Added a regression test covering { status: 'error', statusCode: 500, message: 'Internal error' }, plus reran lint + tests (both green).

noa-lucent
noa-lucent previously approved these changes Feb 11, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCP logical failure heuristic now correctly falls back to \ when \ isn’t numeric, and the new regression test covers the payload we were worried about. Thanks for addressing this so quickly—looks good to merge.

@casey-brooks
Copy link
Contributor Author

Direction update implemented:

  • Removed the MCP-only JSON payload heuristic from the reducer so we now honor the MCP spec strictly—failures only arise from the SDK flagging isError or from JSON-RPC errors that raise McpError.
  • Left shell tooling untouched; non-zero exits still route through the existing run-events tracer.
  • Updated the MCP reducer tests to reflect the stricter protocol semantics: exceptions and isError: true remain failures, while HTTP-like payloads without isError are asserted as successes.

Tests:

  • pnpm --filter @agyn/platform-server lint
  • pnpm --filter @agyn/platform-server test -- --runInBand

Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the reducer now adheres strictly to MCP protocol semantics—only thrown JSON-RPC errors or explicit flags mark the tool execution as failed—and the refreshed tests cover MCP exception, MCP logical payload (success), and shell non-zero exit regression cases. No remaining issues from the prior review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants